21 research outputs found
The language of sounds unheard: Exploring sensory semantic knowledge in large language models
Semantic dimensions of sound have been playing a central role in
understanding the nature of auditory sensory experience as well as the broader
relation between perception, language, and meaning. Accordingly, and given the
recent proliferation of large language models (LLMs), here we asked whether
such models exhibit an organisation of perceptual semantics similar to those
observed in humans. Specifically, we prompted ChatGPT, a chatbot based on a
state-of-the-art LLM, to rate musical instrument sounds on a set of 20 semantic
scales. We elicited multiple responses in separate chats, analogous to having
multiple human raters. ChatGPT generated semantic profiles that only partially
correlated with human ratings, yet showed robust agreement along well-known
psychophysical dimensions of musical sounds such as brightness (bright-dark)
and pitch height (deep-high). Exploratory factor analysis suggested the same
dimensionality but different spatial configuration of a latent factor space
between the chatbot and human ratings. Unexpectedly, the chatbot showed degrees
of internal variability that were comparable in magnitude to that of human
ratings. Our work highlights the potential of LLMs to capture salient
dimensions of human sensory experience.Comment: 12 pages, 3 figure
The Responsibility Problem in Neural Networks with Unordered Targets
We discuss the discontinuities that arise when mapping unordered objects to
neural network outputs of fixed permutation, referred to as the responsibility
problem. Prior work has proved the existence of the issue by identifying a
single discontinuity. Here, we show that discontinuities under such models are
uncountably infinite, motivating further research into neural networks for
unordered data.Comment: Accepted for TinyPaper archival at ICLR 2023:
https://openreview.net/forum?id=jd7Hy1jRiv
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions
Diffusion models have shown promising results for a wide range of generative
tasks with continuous data, such as image and audio synthesis. However, little
progress has been made on using diffusion models to generate discrete symbolic
music because this new class of generative models are not well suited for
discrete data while its iterative sampling process is computationally
expensive. In this work, we propose a diffusion model combined with a
Generative Adversarial Network, aiming to (i) alleviate one of the remaining
challenges in algorithmic music generation which is the control of generation
towards a target emotion, and (ii) mitigate the slow sampling drawback of
diffusion models applied to symbolic music generation. We first used a trained
Variational Autoencoder to obtain embeddings of a symbolic music dataset with
emotion labels and then used those to train a diffusion model. Our results
demonstrate the successful control of our diffusion model to generate symbolic
music with a desired emotion. Our model achieves several orders of magnitude
improvement in computational cost, requiring merely four time steps to denoise
while the steps required by current state-of-the-art diffusion models for
symbolic music generation is in the order of thousands
Interactive Neural Resonators
In this work, we propose a method for the controllable synthesis of real-time
contact sounds using neural resonators. Previous works have used physically
inspired statistical methods and physical modelling for object materials and
excitation signals. Our method incorporates differentiable second-order
resonators and estimates their coefficients using a neural network that is
conditioned on physical parameters. This allows for interactive dynamic control
and the generation of novel sounds in an intuitive manner. We demonstrate the
practical implementation of our method and explore its potential creative
applications
The Semantics of Timbre
Because humans lack a sensory vocabulary for auditory experiences, timbral qualities of sounds are often conceptualized and communicated through readily available sensory attributes from different modalities (e.g., bright, warm, sweet) but also through the use of onomatopoeic attributes (e.g., ringing, buzzing, shrill) or nonsensory attributes relating to abstract constructs (e.g., rich, complex, harsh). The analysis of the linguistic description of timbre, or timbre semantics, can be considered as one way to study its perceptual representation empirically. In the most commonly adopted approach, timbre is considered as a set of verbally defined perceptual attributes that represent the dimensions of a semantic timbre space. Previous studies have identified three salient semantic dimensions for timbre along with related acoustic properties. Comparisons with similarity-based multidimensional models confirm the strong link between perceiving timbre and talking about it. Still, the cognitive and neural mechanisms of timbre semantics remain largely unknown and underexplored, especially when one looks beyond the case of acoustic musical instruments
Composer Style-specific Symbolic Music Generation Using Vector Quantized Discrete Diffusion Models
Emerging Denoising Diffusion Probabilistic Models (DDPM) have become
increasingly utilised because of promising results they have achieved in
diverse generative tasks with continuous data, such as image and sound
synthesis. Nonetheless, the success of diffusion models has not been fully
extended to discrete symbolic music. We propose to combine a vector quantized
variational autoencoder (VQ-VAE) and discrete diffusion models for the
generation of symbolic music with desired composer styles. The trained VQ-VAE
can represent symbolic music as a sequence of indexes that correspond to
specific entries in a learned codebook. Subsequently, a discrete diffusion
model is used to model the VQ-VAE's discrete latent space. The diffusion model
is trained to generate intermediate music sequences consisting of codebook
indexes, which are then decoded to symbolic music using the VQ-VAE's decoder.
The results demonstrate our model can generate symbolic music with target
composer styles that meet the given conditions with a high accuracy of 72.36%
Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis
Differentiable digital signal processing (DDSP) techniques, including methods
for audio synthesis, have gained attention in recent years and lend themselves
to interpretability in the parameter space. However, current differentiable
synthesis methods have not explicitly sought to model the transient portion of
signals, which is important for percussive sounds. In this work, we present a
unified synthesis framework aiming to address transient generation and
percussive synthesis within a DDSP framework. To this end, we propose a model
for percussive synthesis that builds on sinusoidal modeling synthesis and
incorporates a modulated temporal convolutional network for transient
generation. We use a modified sinusoidal peak picking algorithm to generate
time-varying non-harmonic sinusoids and pair it with differentiable noise and
transient encoders that are jointly trained to reconstruct drumset sounds. We
compute a set of reconstruction metrics using a large dataset of acoustic and
electronic percussion samples that show that our method leads to improved onset
signal reconstruction for membranophone percussion instruments.Comment: To be published in The Proceedings of Forum Acusticum, Sep 2023,
Turin, Ital